Multiply - Add Optimized Fft Kernels

نویسندگان

  • HERBERT KARNER
  • MARTIN AUER
  • CHRISTOPH W. UEBERHUBER
چکیده

Modern computer architecture provides a special instruction|the fused multiply-add (FMA) instruction|to perform both a multiplication and an addition operation at the same time. In this paper newly developed radix-2, radix-3, and radix-5 FFT kernels that e ciently take advantage of this powerful instruction are presented. If a processor is provided with FMA instructions, the radix-2 FFT algorithm introduced has the lowest complexity of all Cooley-Tukey radix-2 algorithms. All oating-point operations are executed as FMA instructions. Compared to conventional radix-3 and radix-5 kernels the new radix-3 and radix-5 kernels greatly improve the utilization of FMA instructions resulting in a signi cant complexity reduction. In general, the advantages of the FFT algorithms presented in this paper are their low complexity, their high e ciency, and their striking simplicity. Numerical experiments show that FFT programs using the new kernels clearly outperform conventional FFT routines, even the best available FFT programs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations on Computers with Overlapping Multiply-Add Instructions

We present a new formulation of fast Fourier transformation (FFT) kernels for radix 2, 3, 4, and 5, which have a perfect balance of multiplies and adds. These kernels give higher performance on machines that have a single multiply–add (mult–add) instruction. We demonstrate the superiority of this new kernel on IBM and SGI workstations. Key word. FFT kernels AMS subject classifications. 65-04, 4...

متن کامل

A radix-16 FFT algorithm suitable for multiply-add instruction based on Goedecker method

A radix-16 fast Fourier transform (FFT) algorithm suitable for multiply-add instruction is proposed. The proposed radix-16 FFT algorithm requires fewer floating-point instructions than the conventional radix-16 FFT algorithm on processors that have a multiplyadd instruction. Moreover, this algorithm has the advantage of fewer loads and stores than either the radix-2, 4 and 8 FFT algorithms or t...

متن کامل

Implementation of FFT Butterfly Algorithm Using SMB Recoding Techniques

Arithmetic operations of high complexity are widely used in Digital Signal Processing (DSP) applications. The FFT algorithms use butterfly method in order to find the output. The Butterfly method includes an addition followed by a multiplication. In this work, we focus on optimizing the design of the fused Add-Multiply (FAM) operator for increasing performance and hence the FFT. Optimization of...

متن کامل

SIMD Vectorization of Straight Line FFT Code

This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. FFT kernels are accelerated by automatically vectorizing blocks of straight line code for processors featuring two-way short vector SIMD extensions like AMD’s 3DNow! and Intel’s SSE 2. Additionally, a special compiler backend is introduc...

متن کامل

Adaptive Dynamic Scheduling of Fft on Hierarchical Memory and Multi - Core Architectures

In this dissertation, we present a framework for expressing, evaluating and executing dynamic schedules for FFT computation on hierarchical and shared memory multiprocessor / multi-core architectures. The framework employs a two layered optimization methodology to adapt the FFT computation to a given architecture and dataset. At installation time, the code generator adapts to the microprocessor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015